This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Wed Sep 11 11:49:41 2024.
Data Description:
This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.
Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
Relevant Paper:
Fanaee-T, Hadi, and Gama, Joao, ‘Event labeling combining ensemble detectors and background knowledge’, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg
## Import required packages
## Loading and exploring the data
library(readr)
daily_data <- read_csv("C:/Users/srish/Downloads/day.csv")
## Rows: 731 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (15): instant, season, yr, mnth, holiday, weekday, workingday, weathers...
## date (1): dteday
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_data<- read_csv("C:/Users/srish/Downloads/hour.csv")
## Rows: 17379 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): instant, season, yr, mnth, hr, holiday, weekday, workingday, weat...
## date (1): dteday
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(daily_data)
## # A tibble: 6 × 16
## instant dteday season yr mnth holiday weekday workingday weathersit
## <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2011-01-01 1 0 1 0 6 0 2
## 2 2 2011-01-02 1 0 1 0 0 0 2
## 3 3 2011-01-03 1 0 1 0 1 1 1
## 4 4 2011-01-04 1 0 1 0 2 1 1
## 5 5 2011-01-05 1 0 1 0 3 1 1
## 6 6 2011-01-06 1 0 1 0 4 1 1
## # ℹ 7 more variables: temp <dbl>, atemp <dbl>, hum <dbl>, windspeed <dbl>,
## # casual <dbl>, registered <dbl>, cnt <dbl>
head(hourly_data)
## # A tibble: 6 × 17
## instant dteday season yr mnth hr holiday weekday workingday
## <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2011-01-01 1 0 1 0 0 6 0
## 2 2 2011-01-01 1 0 1 1 0 6 0
## 3 3 2011-01-01 1 0 1 2 0 6 0
## 4 4 2011-01-01 1 0 1 3 0 6 0
## 5 5 2011-01-01 1 0 1 4 0 6 0
## 6 6 2011-01-01 1 0 1 5 0 6 0
## # ℹ 8 more variables: weathersit <dbl>, temp <dbl>, atemp <dbl>, hum <dbl>,
## # windspeed <dbl>, casual <dbl>, registered <dbl>, cnt <dbl>
str(daily_data)
## spc_tbl_ [731 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ instant : num [1:731] 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : Date[1:731], format: "2011-01-01" "2011-01-02" ...
## $ season : num [1:731] 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : num [1:731] 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : num [1:731] 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : num [1:731] 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : num [1:731] 6 0 1 2 3 4 5 6 0 1 ...
## $ workingday: num [1:731] 0 0 1 1 1 1 1 0 0 1 ...
## $ weathersit: num [1:731] 2 2 1 1 1 1 2 2 1 1 ...
## $ temp : num [1:731] 0.344 0.363 0.196 0.2 0.227 ...
## $ atemp : num [1:731] 0.364 0.354 0.189 0.212 0.229 ...
## $ hum : num [1:731] 0.806 0.696 0.437 0.59 0.437 ...
## $ windspeed : num [1:731] 0.16 0.249 0.248 0.16 0.187 ...
## $ casual : num [1:731] 331 131 120 108 82 88 148 68 54 41 ...
## $ registered: num [1:731] 654 670 1229 1454 1518 ...
## $ cnt : num [1:731] 985 801 1349 1562 1600 ...
## - attr(*, "spec")=
## .. cols(
## .. instant = col_double(),
## .. dteday = col_date(format = ""),
## .. season = col_double(),
## .. yr = col_double(),
## .. mnth = col_double(),
## .. holiday = col_double(),
## .. weekday = col_double(),
## .. workingday = col_double(),
## .. weathersit = col_double(),
## .. temp = col_double(),
## .. atemp = col_double(),
## .. hum = col_double(),
## .. windspeed = col_double(),
## .. casual = col_double(),
## .. registered = col_double(),
## .. cnt = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(hourly_data)
## spc_tbl_ [17,379 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ instant : num [1:17379] 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : Date[1:17379], format: "2011-01-01" "2011-01-01" ...
## $ season : num [1:17379] 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : num [1:17379] 1 1 1 1 1 1 1 1 1 1 ...
## $ hr : num [1:17379] 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : num [1:17379] 6 6 6 6 6 6 6 6 6 6 ...
## $ workingday: num [1:17379] 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit: num [1:17379] 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num [1:17379] 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
## $ atemp : num [1:17379] 0.288 0.273 0.273 0.288 0.288 ...
## $ hum : num [1:17379] 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
## $ windspeed : num [1:17379] 0 0 0 0 0 0.0896 0 0 0 0 ...
## $ casual : num [1:17379] 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: num [1:17379] 13 32 27 10 1 1 0 2 7 6 ...
## $ cnt : num [1:17379] 16 40 32 13 1 1 2 3 8 14 ...
## - attr(*, "spec")=
## .. cols(
## .. instant = col_double(),
## .. dteday = col_date(format = ""),
## .. season = col_double(),
## .. yr = col_double(),
## .. mnth = col_double(),
## .. hr = col_double(),
## .. holiday = col_double(),
## .. weekday = col_double(),
## .. workingday = col_double(),
## .. weathersit = col_double(),
## .. temp = col_double(),
## .. atemp = col_double(),
## .. hum = col_double(),
## .. windspeed = col_double(),
## .. casual = col_double(),
## .. registered = col_double(),
## .. cnt = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
summary(daily_data)
## instant dteday season yr
## Min. : 1.0 Min. :2011-01-01 Min. :1.000 Min. :0.0000
## 1st Qu.:183.5 1st Qu.:2011-07-02 1st Qu.:2.000 1st Qu.:0.0000
## Median :366.0 Median :2012-01-01 Median :3.000 Median :1.0000
## Mean :366.0 Mean :2012-01-01 Mean :2.497 Mean :0.5007
## 3rd Qu.:548.5 3rd Qu.:2012-07-01 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :731.0 Max. :2012-12-31 Max. :4.000 Max. :1.0000
## mnth holiday weekday workingday
## Min. : 1.00 Min. :0.00000 Min. :0.000 Min. :0.000
## 1st Qu.: 4.00 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.000
## Median : 7.00 Median :0.00000 Median :3.000 Median :1.000
## Mean : 6.52 Mean :0.02873 Mean :2.997 Mean :0.684
## 3rd Qu.:10.00 3rd Qu.:0.00000 3rd Qu.:5.000 3rd Qu.:1.000
## Max. :12.00 Max. :1.00000 Max. :6.000 Max. :1.000
## weathersit temp atemp hum
## Min. :1.000 Min. :0.05913 Min. :0.07907 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.33708 1st Qu.:0.33784 1st Qu.:0.5200
## Median :1.000 Median :0.49833 Median :0.48673 Median :0.6267
## Mean :1.395 Mean :0.49538 Mean :0.47435 Mean :0.6279
## 3rd Qu.:2.000 3rd Qu.:0.65542 3rd Qu.:0.60860 3rd Qu.:0.7302
## Max. :3.000 Max. :0.86167 Max. :0.84090 Max. :0.9725
## windspeed casual registered cnt
## Min. :0.02239 Min. : 2.0 Min. : 20 Min. : 22
## 1st Qu.:0.13495 1st Qu.: 315.5 1st Qu.:2497 1st Qu.:3152
## Median :0.18097 Median : 713.0 Median :3662 Median :4548
## Mean :0.19049 Mean : 848.2 Mean :3656 Mean :4504
## 3rd Qu.:0.23321 3rd Qu.:1096.0 3rd Qu.:4776 3rd Qu.:5956
## Max. :0.50746 Max. :3410.0 Max. :6946 Max. :8714
summary(hourly_data)
## instant dteday season yr
## Min. : 1 Min. :2011-01-01 Min. :1.000 Min. :0.0000
## 1st Qu.: 4346 1st Qu.:2011-07-04 1st Qu.:2.000 1st Qu.:0.0000
## Median : 8690 Median :2012-01-02 Median :3.000 Median :1.0000
## Mean : 8690 Mean :2012-01-02 Mean :2.502 Mean :0.5026
## 3rd Qu.:13034 3rd Qu.:2012-07-02 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :17379 Max. :2012-12-31 Max. :4.000 Max. :1.0000
## mnth hr holiday weekday
## Min. : 1.000 Min. : 0.00 Min. :0.00000 Min. :0.000
## 1st Qu.: 4.000 1st Qu.: 6.00 1st Qu.:0.00000 1st Qu.:1.000
## Median : 7.000 Median :12.00 Median :0.00000 Median :3.000
## Mean : 6.538 Mean :11.55 Mean :0.02877 Mean :3.004
## 3rd Qu.:10.000 3rd Qu.:18.00 3rd Qu.:0.00000 3rd Qu.:5.000
## Max. :12.000 Max. :23.00 Max. :1.00000 Max. :6.000
## workingday weathersit temp atemp
## Min. :0.0000 Min. :1.000 Min. :0.020 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:0.340 1st Qu.:0.3333
## Median :1.0000 Median :1.000 Median :0.500 Median :0.4848
## Mean :0.6827 Mean :1.425 Mean :0.497 Mean :0.4758
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:0.660 3rd Qu.:0.6212
## Max. :1.0000 Max. :4.000 Max. :1.000 Max. :1.0000
## hum windspeed casual registered
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.0
## 1st Qu.:0.4800 1st Qu.:0.1045 1st Qu.: 4.00 1st Qu.: 34.0
## Median :0.6300 Median :0.1940 Median : 17.00 Median :115.0
## Mean :0.6272 Mean :0.1901 Mean : 35.68 Mean :153.8
## 3rd Qu.:0.7800 3rd Qu.:0.2537 3rd Qu.: 48.00 3rd Qu.:220.0
## Max. :1.0000 Max. :0.8507 Max. :367.00 Max. :886.0
## cnt
## Min. : 1.0
## 1st Qu.: 40.0
## Median :142.0
## Mean :189.5
## 3rd Qu.:281.0
## Max. :977.0
# Interactive Plot for Daily Data
library(timetk)
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
daily_data %>% plot_time_series(.date_var = dteday, .value = cnt, .interactive = TRUE, .plotly_slider = TRUE, .title = "Daily Bike Rentals", .x_lab = "Date", .y_lab = "Number of Rentals")
# Interactive Plot for Hourly Data
hourly_data %>% plot_time_series(.date_var = dteday, .value = cnt, .interactive = TRUE, .plotly_slider = TRUE, .title = "Hourly Bike Rentals", .x_lab = "Datetime", .y_lab = "Number of Rentals")
# calculating moving average of daily bike rentals
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
smooth_d <- ma(daily_data$cnt, order=7)
#creating a ggplot object
install.packages("ggplot2")
## Warning: package 'ggplot2' is in use and will not be installed
library(ggplot2)
ggplot(daily_data, aes(x= dteday, y= cnt)) + geom_line(color = "blue") + geom_line(aes(y= smooth_d), color= "red") + labs(title= "Original and Smoothed Daily Bike Rentals", x= "Date", y= "Number of Rentals") + theme_classic()
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_line()`).
# creating a time series object
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
daily_ts <- ts(daily_data$cnt, start = c(2011, 1, 1), frequency = 365)
daily_dates <- seq.Date(from = as.Date("2011-01-01"), by = "day", length.out = length(daily_data$cnt))
# decomposing daily_ts into seasonal, trend, residual components
decomp <- stl(daily_ts, s.window = "periodic")
# plotting
plot(decomp)
# running the ADF Test to assess stationarity
library(tseries)
adf.test(daily_ts) # results = not stationary
##
## Augmented Dickey-Fuller Test
##
## data: daily_ts
## Dickey-Fuller = -1.6351, Lag order = 9, p-value = 0.7327
## alternative hypothesis: stationary
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
monthly_ts <- aggregate(daily_data$cnt, by = list(format(daily_dates, "%Y-%m")), FUN = sum)
# making it stationary
## log transformation
monthly_ts_log <- log(monthly_ts$x)
adf.test(monthly_ts_log) ###not stationary enough
##
## Augmented Dickey-Fuller Test
##
## data: monthly_ts_log
## Dickey-Fuller = -3.3434, Lag order = 2, p-value = 0.08563
## alternative hypothesis: stationary
## differencing
monthly_ts_diff <- diff(monthly_ts$x)
adf.test(monthly_ts_diff) ###not stationary enough
##
## Augmented Dickey-Fuller Test
##
## data: monthly_ts_diff
## Dickey-Fuller = -2.001, Lag order = 2, p-value = 0.572
## alternative hypothesis: stationary
## combining methods
monthly_ts_diff_log <- diff(log(monthly_ts$x))
adf.test(monthly_ts_diff_log)
##
## Augmented Dickey-Fuller Test
##
## data: monthly_ts_diff_log
## Dickey-Fuller = -2.4962, Lag order = 2, p-value = 0.3834
## alternative hypothesis: stationary
# splitting the data into training and testing sets
set.seed(123)
train_index <- sample(nrow(daily_data), 0.8*nrow(daily_data))
train_data <- daily_data[train_index, ]
test_data <- daily_data[-train_index, ]
# building an ARIMA model
library(forecast)
arima_model <- auto.arima(train_data$cnt, ic= "bic")
# evaluate the model using test data
## loading stats package
library(stats)
## defining p, d, q
p <- 1 # AR order
q <- 1 # differencing order
q <- 1 # MA order
arima_model <- arima(train_data$cnt, order = c(1, 1, 1))
summary(arima_model)
##
## Call:
## arima(x = train_data$cnt, order = c(1, 1, 1))
##
## Coefficients:
## ar1 ma1
## -0.0638 -1.0000
## s.e. 0.0414 0.0048
##
## sigma^2 estimated as 3655586: log likelihood = -5235.56, aic = 10477.12
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -34.18823 1910.321 1559.299 -69.81913 92.5421 0.6893145
## ACF1
## Training set 0.001001772
Task 1: Open and Examine the Data After the hourly and daily data were successfully loaded and examined, a clear seasonal pattern was found in the data.
Task 2: Create Interactive Time Series Plots: To better understand the structure of the data, interactive plots were made to visualize the hourly and daily data.
Task 3: Smooth Time Series Data: The underlying trend was highlighted in the daily data by using the moving average smoothing technique.
Task 4: Decompose and assess Time Series Data Stationarity: Once the seasonal, trend, and residual components of the data were separated out, the ADF test showed that the data was not stationary. The data became stationary after differencing and log transformation were applied.
Task Five: Using ARIMA Models to Fit and Forecast Time Series Data: After being developed and assessed, an ARIMA model was found to fit the data well.
Given the significant seasonal and trend components in the data, the outcomes were generally what was anticipated. These patterns were well captured by the ARIMA model, and the forecasting outcomes were plausible.
Overall, this project showed the value of using R’s time series analysis tools and offered insightful analysis of the daily bike rental demand dataset.